This is a dataset studying various body measurements as they relate to the percentage of body fat determined by underwater weighing for 252 men.
Body fat is an important health measurement. However, accurate measurement of body fat is inconvenient and costly. Thus, it is desirable to have easy methods of estimating body fat that are not as inconvenient or costly. Body fat can be estimated from tables using age and various skin-fold measurements obtained by using a caliper. Other estimates can be obtained from predictive equations using body circumference measurements (e.g. abdominal circumference) and/or skin-fold measurements.
Download the dataset from here or from bcourses. Make sure it is in your current working directory. If not, move the file or change your working directory with setwd().
setwd("~/Dropbox/Berkeley/151A/")
bodyfat<- read.csv("Bodyfat.csv")
head(bodyfat)
## Density bodyfat Age Weight Height Neck Chest Abdomen Hip Thigh Knee
## 1 1.0708 12.3 23 154.25 67.75 36.2 93.1 85.2 94.5 59.0 37.3
## 2 1.0853 6.1 22 173.25 72.25 38.5 93.6 83.0 98.7 58.7 37.3
## 3 1.0414 25.3 22 154.00 66.25 34.0 95.8 87.9 99.2 59.6 38.9
## 4 1.0751 10.4 26 184.75 72.25 37.4 101.8 86.4 101.2 60.1 37.3
## 5 1.0340 28.7 24 184.25 71.25 34.4 97.3 100.0 101.9 63.2 42.2
## 6 1.0502 20.9 24 210.25 74.75 39.0 104.5 94.4 107.8 66.0 42.0
## Ankle Biceps Forearm Wrist
## 1 21.9 32.0 27.4 17.1
## 2 23.4 30.5 28.9 18.2
## 3 24.0 28.8 25.2 16.6
## 4 22.8 32.4 29.4 18.2
## 5 24.0 32.2 27.7 17.7
## 6 25.6 35.7 30.6 18.8
To understand what information the data is giving us, we need to understand what the variable names mean. From the data description, we find that the variables are:
It is always a good idea to take a look at your data to make sure everything looks reasonable (variable names make sense, measurement units are correct, there are no absurd outliers).
summary(bodyfat)
## Density bodyfat Age Weight
## Min. :0.995 Min. : 0.00 Min. :22.00 Min. :118.5
## 1st Qu.:1.041 1st Qu.:12.47 1st Qu.:35.75 1st Qu.:159.0
## Median :1.055 Median :19.20 Median :43.00 Median :176.5
## Mean :1.056 Mean :19.15 Mean :44.88 Mean :178.9
## 3rd Qu.:1.070 3rd Qu.:25.30 3rd Qu.:54.00 3rd Qu.:197.0
## Max. :1.109 Max. :47.50 Max. :81.00 Max. :363.1
## Height Neck Chest Abdomen
## Min. :29.50 Min. :31.10 Min. : 79.30 Min. : 69.40
## 1st Qu.:68.25 1st Qu.:36.40 1st Qu.: 94.35 1st Qu.: 84.58
## Median :70.00 Median :38.00 Median : 99.65 Median : 90.95
## Mean :70.15 Mean :37.99 Mean :100.82 Mean : 92.56
## 3rd Qu.:72.25 3rd Qu.:39.42 3rd Qu.:105.38 3rd Qu.: 99.33
## Max. :77.75 Max. :51.20 Max. :136.20 Max. :148.10
## Hip Thigh Knee Ankle
## Min. : 85.0 Min. :47.20 Min. :33.00 Min. :19.1
## 1st Qu.: 95.5 1st Qu.:56.00 1st Qu.:36.98 1st Qu.:22.0
## Median : 99.3 Median :59.00 Median :38.50 Median :22.8
## Mean : 99.9 Mean :59.41 Mean :38.59 Mean :23.1
## 3rd Qu.:103.5 3rd Qu.:62.35 3rd Qu.:39.92 3rd Qu.:24.0
## Max. :147.7 Max. :87.30 Max. :49.10 Max. :33.9
## Biceps Forearm Wrist
## Min. :24.80 Min. :21.00 Min. :15.80
## 1st Qu.:30.20 1st Qu.:27.30 1st Qu.:17.60
## Median :32.05 Median :28.70 Median :18.30
## Mean :32.27 Mean :28.66 Mean :18.23
## 3rd Qu.:34.33 3rd Qu.:30.00 3rd Qu.:18.80
## Max. :45.00 Max. :34.90 Max. :21.40
hist(bodyfat$Biceps)
boxplot(bodyfat$Biceps)
Boxplots are great for identifying potential outliers. We can investigate the bicep outlier further.
outlier<- which.max(bodyfat$Biceps)
bodyfat[outlier,]
## Density bodyfat Age Weight Height Neck Chest Abdomen Hip Thigh Knee
## 39 1.0202 35.2 46 363.15 72.25 51.2 136.2 148.1 147.7 87.3 49.1
## Ankle Biceps Forearm Wrist
## 39 29.6 45 29 21.4
It is sometimes a good idea to remove outliers from data, especially if that observation is an outlier across many different variables. To remove outlier rows, identify the indices of the outliers, then simply use
bodyfat[-outlier,]
Scatterplots are useful to explore the relationship between two variables.
plot(bodyfat$Biceps, bodyfat$Forearm)
We can look at all possible relationships using the pairs function.
pairs(bodyfat)
From these scatterplots, we can see many strong correlations. Weight looks correlated with many other variables. Body measurements appear correlated according to body location.
How can we make sense of these many correlations? A coplot is a helpful way to visualize the relationship between two variables, conditioned on a third variable.
We might be interested in predicting bodyfat by neck and abdominal circumference.
pairs(~bodyfat + Neck + Abdomen, data = bodyfat)
Both neck and abdominal circumference seem to be good predictors of bodyfat. Are both variables necessary for prediction, or are they telling us the same information?
coplot(bodyfat ~ Neck | Abdomen, data = bodyfat, rows = 1)
Here we see that each scatterplot does not show a clear trend. This indicates that knowing the neck circumference does not give us much more information about bodyfat than we have from knowing abdominal circumference. Thus, if we know the abdomen circumference, the neck circumference does not improve our prediction of bodyfat.